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—This  work  considers  tree  algorithms  for  the  N-body  problem  where 
the  number  of  particles  is  on  the  order  of  a  million  The  main  concern 
of  this  work  is  the  organization  and  performance  of  these  computations 
on  parallel  computers. 

The  work  introduces  a  formulation  of  the  N-body  problem  as  a 
set  of  recursive  equations  based  on  a  few  elementary  functions.  It  is 
shown  that  both  the  algorithm  of  Barnes-Hut  and  that  of  Greengard- 
Rokhlin  satisfy  these  equations  using  different  elementary  functions. 
The  recursive  formulation  leads  directly  to  a  computational  structure 
in  the  form  of  a  pyramid-like  graph,  where  each  vertex  is  a  process, 
and  each  arc  a  communication  link. 

The  pyramid  is  mapped  to  three  different  processor  configurations: 
(1)  A  pyramid  of  processors  corresponding  to  the  processes  pyramid 
graph;  (2)  An  hypercube  of  processors,  e.g.,  a  connection-machine  like 
architecture.  (3)  A  rather  small  array,  e.g.,  2  x  2  x  2,  of  processors 
faster  then  the  ones  consider  in  (1)  and  (2)  above. 

The  main  conclusion  is  that  simulations  of  this  size  can  be  per¬ 
formed  on  any  of  the  three  architectures  in  reasonable  time.  20  sec¬ 
onds  per  time  step  is  the  estimate  for  a  million  equally  distributed 
particles  using  the  Greengard-Rokhiin  algorithm  on  the  CM-2  connec¬ 
tion  machine.  The  smaller  array  of  processors  is  quite  competitive  in 
performance.  *•*, , . 


V  - 

0  Accesion  For 

- - 

1 

^  NTIS  CRA’I 

y 

OTIC  TAB 

n 

□ 

V  t  1  '< 

t  |  . . 

b  !  e>  .  . 

V,  '  P.v.  :  '  ■ 

1  h- . -•  -• 

- J 

j 

j  . . 

■  •  •  ^  j 

S  y.  : 

i 

•>!  1 

VWi 


cop* 

v  n<5pec*tso 

v  6  y 


MASSACHUSETTS  INSTITUTE  OF  TECHNOLOGY 
ARTIFICIAL  INTELLIGENCE  LABORATORY 


v  ir.  trj  «Tu  w rjTrjVJWJV'.  r.v.VJ 


A/  Memo  1042 


April  1988 


Computational  Structure  of  the 
N-body  Problem 

Jacob  Katzenelson1 


■1 


Abstract 

This  work  considers  tree  algorithms  for  the  N-body  problem  where 
the  number  of  particles  is  on  the  order  of  a  million.  The  main  concern 
of  this  work  is  the  organization  and  performance  of  these  computations 
on  parallel  computers. 

The  work  introduces  a  formulation  of  the  N-body  problem  as  a 
set  of  recursive  equations  based  on  a  few  elementary  functions.  It  is 
shown  that  both  the  algorithm  of  Barnes-Hut  and  that  of  Greengard- 
Rokhlin  satisfy  these  equations  using  different  elementary  functions. 
The  recursive  formulation  leads  directly  to  a  computational  structure 
in  the  form  of  a  pyramid-like  graph,  where  each  vertex  is  a  process, 
and  each  arc  a  communication  link. 

The  pyramid  is  mapped  to  three  different  processor  configurations: 

(1)  A  pyramid  of  processors  corresponding  to  the  processes  pyramid 
graph;  (2)  An  hypercube  of  processors,  e.g.,  a  connection-machine  like 
architecture.  (3)  A  rather  small  array,  e.g.,  2  x  2  x  2,  of  processors 
faster  then  the  ones  consider  in  (1)  and  (2)  above. 

The  main  conclusion  is  that  simulations  of  this  size  can  be  per¬ 
formed  on  any  of  the  three  architectures  in  reasonable  time.  24  sec¬ 
onds  per  time  step  is  the  estimate  for  a  million  equally  distributed 
particles  using  the  Greengard-Rokhlin  algorithm  on  the  CM-2  con¬ 
nection  machine.  The  smaller  array  of  processors  is  quite  competitive 
in  performance. 

Keywords:  N-body  problem,  particle  simulation,  tree  algorithms 
for  particle  simulation,  parallel  computing. 
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1  Introduction 

The  study  of  physical  systems  by  particle  simulation  is  called  the  “many- 
body”  or  the  “N-body”  problem.  Such  studies  are  conducted  in  celestial 
mechanics,  plasma  physics,  fluid  mechanics  as  well  as  in  semiconductor  device 
simulation  [1].  The  simulation  finds  the  trajectories  of  each  particle  over  some 
time  interval,  given  the  initial  position,  the  initial  velocity,  the  external  force 
and  the  nature  of  the  forces  that  the  particles  exert  on  each  other. 

The  number  of  particles  used  for  such  studies  is  quite  large.  It  is  es¬ 
timated  that  to  get  insight  into  three-dimensional  turbulent  flow  about  I 
million  particles  are  needed  [2j.  Thus,  such  simulations  require  intensive  and 
prolonged  computations  and  the  question  of  efficiency  is  of  great  interest. 

This  work  considers  a  family  of  algorithms  for  calculating  the  force  or  the 
potential  $  which  can  be  called  “tree  algorithms”.  These  algorithms  reduce 
the  asymptotic  computational  complexity  from  0(N2)  in  the  naive  approach, 
where  each  particle  interacts  with  each  other  particle,  to  0(N log  N)  in  [5] 
[6],  and  down  to  0{N)  in  [7]  (for  particles  in  two  dimensions)  and  in  [9]  and 
[11]  (for  particles  in  three  dimensions).  The  main  goal  of  this  work  is  the 
organization  and  performance  of  such  algorithms  on  parallel  computers.  In 
particular,  it  considers  how  such  computers  should  be  organized  in  order  to 
handle  tree  algorithms  efficiently  when  the  number  of  particles  is  very  large, 
and  how  time  is  divided  between  computation  time  and  time  for  communi¬ 
cation  between  processors. 

The  work  introduces  a  formulation  of  the  many- body  problem  a s  a  set  of 
recursive  equations  based  on  a  few  elementary  functions.  It  is  shown  that  the 
algorithms  of  both  [6]  and  [7]  satisfy  these  equations  using  different  elemen¬ 
tary  functions.  The  recursive  formulation  leads  directly  to  a  computational 
structure  in  the  form  of  a  pyramid-like  graph,  where  each  vertex  is  a  process, 
and  to  a  SIMD  computational  algorithm. 

The  pyramid  is  mapped  to  three  different  processor  configurations: 

•  A  pyramid  of  processors  corresponding  to  the  processes  pyramid  graph, 
where  each  vertex  is  assigned  a  process  and  each  arc  —  a  communica¬ 
tion  “wire”. 

•  An  hypercube  of  processors  ,i.e,,  a  connection- machine  like  architecture 
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•  A  smaller  array  of  larger  processors  ,  e.g.,  2x2x2  processors. 

Computation  and  communication  time  are  computed  for  the  first  two 
cases  and  estimates  are  derived  for  the  third  case. 

The  main  qualitative  conclusion  is  that  large  simulations  (about  1M  par¬ 
ticle)  can  be  performed  on  any  of  the  above  architectures  in  reasonable  time. 
This  is  particularly  interesting  with  respect  to  the  connection  machine  -  an 
existing,  ready  to  be  used  system  -  but  also  noteworthy  with  respect  to  the 
array  of  processors  which  is  quite  competitive  in  performance. 
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2  A  Recursive  Formulation  of  the  N-body  Problem 

We  consider  N  point  particles  distributed  in  two  or  three  dimensional  space 
(2D  or  3D).  We  are  interested  in  the  forces  that  these  particles  exert  on  each 
other.  We  use  celestial  mechanics  terminology  (point  masses  and  Newtonian 
gravitational  fields)  although  the  method  is  equally  applicable  to  electrostat¬ 
ics  (electrical  charges  and  electrostatic  fields)  and  fluid  mechanics  (vortex 
particle  and  vortex  field)  [1]. 

The  force  that  a  particle  at  a  point  x,  x  6  .ft3,  exerts  on  a  particle  y, 
y  €  R3,  is  given  by 


fry  ~  Gr~, 


MXMV 
x-y  | 


where  lr  is  a  unit  vector  in  the  x  —  y  direction.  The  forces  are  additive,  i.e., 
the  total  force  on  any  particle  p ,  F„  is  the  sum  of  the  forces  due  to  all  other 
particles 

Fp  —  /w  (2) 

q£paTticle»,qytp 

An  additional  valuable  property  of  these  forces  is  the  symmetry,  fpq  = 
-f9P  .  It  is  also  convenient  to  express  the  forces  in  terms  of  a  potential 
function,  $,  such  that  the  force  on  a  unit  mass  at  i  is 

Fx  —  —grad  $  (3) 

The  tree  algorithm  proceeds  by  dividing  the  space  into  “computational 
boxes”.  The  method  is  illustrated  by  its  2D  version  which  is  simpler  to 
explain;  the  generalization  to  3D  is  straightforward. 

We  choose  a  (2D)  rectangle  such  that  all  the  particles  are  included  in  it. 
This  is  called  the  top  rectangle.  It  is  partitioned  into  four  equal  rectangles, 
called  the  children  of  the  top  rectangle.  Top  is  called  the  parent  of  the 
children  rectangles.  Each  child  is  partitioned  to  four  equal  rectangles.  This 
process  continues  to  an  arbitrary  level,  say  h.  For  compatibility  with  prior 
work  (11),  the  level  of  top  is  named  the  0  -  level,  so  the  bottom  level  becomes 
the  A-th  level. 

The  set  of  neighbors  of  a  computational  box  6  comprises  all  boxes  of 
the  same  size  whose  boundary  hats  at  leaist  one  point  in  common  with  the 


boundary  of  b.  Thus,  in  2D  a  box  has  at  most  eight  neighbors;  in  3D  a  box 
has  26  neighbors. 


Figure  1:  The  computational  box,  its  parent,  children  and  neighbors;  Aa,Ab, 
Ac, Ad  are  children  of  A;  A  is  a  parent  of  Aa,etc.;The  neighbors  of  Ad  are: 
Aa,Ab,  Ba,  Bb,Da,Cc,Ca,  and  Ac. 


It  is  convenient  to  partition  the  force,  or  the  potential,  into  the  far  field 
and  the  near  field  so  that  $p  =  $Pineo T field  +  The  reasons  for  this 

partition  are  twofold:  (1)  Approximation  formulas  valid  for  the  far  field  po¬ 
tential  are  not  valid  for  the  new  field  potential;  (2)  Considerations  related  to 
the  finite  computer  word  size  and  to  global  accuracy  necessitate  modification 
of  the  new  field  at  small  distances  (smoothing  [3]  [4]).  In  partitioning  to  far 
and  new  fields  we  follow  [7,9]  by  taking  the  field  at  computational  box  due 
to  particles  in  itself  and  in  its  neighbor  boxes  to  be  the  near-field  potential; 
the  field  in  the  box  due  to  all  boxes  which  are  not  neighbors  is  the  fw-field 
potential. 

Let  $n  be  the  far-field  potential  function  due  to  pwticles  in  the  box  n. 
For  any  x,  x  £  n  and  x  £  ntighbors(n),  $n(x)  is  the  contribution  of  pwticles 
in  n  to  the  potential  at  x. 

Let  'Pn  be  the  far-field  potential  in  n;  i.e.,  for  any  x,  x  6  n,  <Pn(x)  is  the 


kll 


contribution  of  all  particles  q,  q  £  n,  and  q  g  neighbors(n)  to  the  potential 
at  the  point  x. 

A  recursive  relation  among  and  'Pparen«(n)  is  the  heart  of  the  tree 
methods: 


(4) 


—  ^porent(n)  4"  ^  ■  $« 

Where  the  set  on  which  the  summation  of  $,  is  done,  is  called  the 
interaction  set  of  n  and  is  defined  by 


In  =  {j\j  2  children(neigkbors(parent(n))),  j  &  neighbors(n)}  (5) 
A  solution  of  4  is  defined  by  its  boundary  conditions 

*tov  =  0,  (6) 


and  by 


*n  =  E  *< 

iGchildren(n) 

where  n  is  not  a  bottom  level  rectangular;  for  a  bottom  level  n 


(?) 


*  =  5° 

*n  —  *  n) 


(8) 


where  is  the  far-field  potential  function  given  in  terms  of  the  particles 


in  n. 


The  Barnes-Hut  and  the  Greengard-Rokhlin  Meth¬ 
ods 


The  Barnes-Hut  (BH  in  the  sequel)  [6]  and  the  Greengard-Rokhlin  (GR)  [7,9] 
methods  both  solve  the  above  recursive  equations.  Their  differences  are  in 
how  the  potential  functions  axe  represented  and  approximated. 

For  presenting  the  material  in  this  section  it  is  useful  to  distinguish  be¬ 
tween  a  function,  say,  /,  its  approximation,  /  and  the  representation  of  this 
function  or  its  approximation,  rep(f)  or  rep(f).  Usually,  we  want  to  repre¬ 
sent  the  function  itself,  however,  for  various  reasons  we  carry  only  enough 
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information  to  define  the  approximation  completely.  Thus,  by  design,  rep(f) 
represents  /. 

In  the  BH  method  the  function  is  represented  by  two  quantities  (m,c) 
where  m  is  the  sum  of  the  masses  of  the  particles  in  the  box  n,  and  c  is  the 
center  of  mass  vector.  Thus,  at  a  point  y,  y  &  neighbors(i)  and  y  t,  is 
approximated  by  which  is  the  potential  of  a  mass  mate  evaluated  at  the 
point  y.  I.e. 


Kit'  =  G; 


m 


y-c 


(9) 


A 

Similarly,  ,  for  level  l,  l  ^  h,  approximates  $n.  It  is  calculated  by 
summing  the  masses  of  the  children  and  finding  a  center  of  mass  of  ■  he 
masses  of  the  children.  The  representation  of  $n  is 


«**.)-(  £  (10) 

ieckildre n(n)  2^iec/iiWren(n)  mt 

The  approximation  to  4/n,  is  derived  from  4;  is  represented  as  a  list 
of  pairs;  the  pairs  (m,,  cj)  of  all  the  members  of  interactive  set  are  appended 
to  form  a  list  which,  in  turn,  is  appended  with  a  similar  list  representation 

^porent(n)* 

The  actual  potential,  or  the  force  at  a  point  p,  is  calculated  by  evaluating 
the  appropriate  $n,  n  at  level  h,  by  evaluating  each  member  of  the  list  at  p, 
and  by  adding  the  near-field  effect  to  the  result. 

The  3D  version  of  the  GR  method  is  described  in  [9]  and  in  [11].  The 
much  simpler  2D  version  suffices  for  showing  how  the  algorithm  fits  the  above 
computational  structure. 

In  the  sequel  we  identify  R2  with  the  complex  plane  C  which  simplifies 
the  notations.  Let  n  be  a  box  on  some  level.  For  all  r,  z  £  n  and  z  0 
neighbor s(n),  the  GR  method  considers  as  its  multipole  expansion  around 
the  center  of  n 


$n(z)  =  Oo  log(z)  +  £  ~T-  (H) 

*=i  z 

For  all  z,  z  £  n,  the  GR  method  considers  'F,,  as  its  local  (Taylor)  expan¬ 
sion  around  the  center  of  n, 
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(12) 


#»(*)  =  £&/*'• 

1=0 

In  both  cases  the  functions  are  approximated  and  represented  by  their 
first  p  +  1  coefficients,  where  p  is  chosen  according  to  the  accuracy  required. 

To  derive  4>n  from  the  4>’s  of  its  children,  equation  7,  the  child  represen¬ 
tation  is  “translated”,  i.e.,  each  child  j  €  children(n)  is  expanded  to  a 
multipole  expansion  around  the  center  of  n.  Once  this  is  done,  the  $’s  of  all 
children  are  expanded  around  the  same  origin  and  $n  is  found  by  summing 
the  coefficients  of  the  translated 

To  derive  (equation  3)  the  t  €  interactive  set  of  n,  are  converted 
to  a  local  expansion  around  the  center  of  the  box  n.  '$parcnt(n)  is  in  this  local 
form  and,  therefore,  the  remaining  operation  is  to  translate  (shift)  the  local 
expansion  from  the  center  of  parent(n)  to  the  center  of  n.  Now,  all  these 
functions  are  represented  in  n  by  their  local  expansions  around  the  same 
point,  and  is  obtained  by  summing  up  the  coefficients  of  all  these  local 
expansions. 

Details  of  the  GR  algorithm  follow: 

Let  n  be  a  box  of  level  h  (bottom  level).  Let  the  center  of  the  box  be  the 
coordinate  system;  let  the  box  have  m  point  masses  {^,,  i  =  l,...,m}  located 
at  points  {z,,  i  =  1. ...,  m}  in  the  box;  let  r  be  the  diagonal  of  the  box  ,  thus, 
|z,|  <  r.  Then,  for  any  z  6  C  with  \z\  >  r, 


$«(*)  =  a0log(z)  +  £ 


where 

_  j  v~'  ~9*  {zi) 

a0  =  2_s  q,  and  a*  =  2^ - Z - • 

t=i  i=i  K 

is  approximated  by  the  first  p  terms  of  the  infinite  series.  It  can  be  shown 
that  for  any  p  >  1 , 


$n(z)  -  Oq  log(z) 

r 

r 

k= l  Z 

1  - 

z 

z 

5; 


Clearly,  is  represented  by  {  a*,  fc  =  0,  ...,p  }  and  the  coordinates  of 
the  center  of  the  n-th  box. 

To  find  i  not  of  level  h,  one  uses  equation  4.  To  sum  the  4>;,  j  € 
children(i),  the  expansion  representing  is  “translated”;  i.e.,  each  is 
expanded  to  a  multipole  expansion  around  the  center  of  t.  Once  this  is  done, 
the  $’s  of  all  children  are  expanded  around  the  same  origin  and  is  found 
by  summing  the  coefficients  of  the  translated 

Let  the  coefficients  di,i  =  0, ...,  oo,  specify  the  multipole  expansion  of 
around  the  point  zq.  The  translation  of  this  expansion  to  the  origin  is  given 
by 


$„(*)  =  *olog{z)  +  £  ~r  (14) 

i=i  2 

where 

<«> 

with  ^  the  binomial  coefficients. 

is  formed  by  retaining  the  first  p  +  1  terms  of  the  above  series.  It  is 
represented  by  {  =  0,  ...,p  }  and  the  coordinates  of  the  center  of  the 

n-th  box. 

'I'n  (equation  3)  is  computed  by  converting  the  $i,  i  €  interactive  set  of  n, 
to  a  local  expansion  around  the  center  of  the  box  n.  This  conversion  is  given 
by  the  following: 

Let  {  a/e,  k  =  0, ...  }  be  the  representation  of  some  j  is  not  a  neighbor  of 
n.  Let  n’s  center  be  the  origin,  and  let  the  center  of  j  be  zo .  The  contribution 
of  to  at  a  point  z  in  n  is 

(16) 

/=0 


co 

bo  =  aolog(-zo)  +  £ 

k=  1 


where 


(17) 
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The  shifting  of  'Jparen^n)  to  the  center  of  n  is  given  by  the  following 
equation:  For  any  zo ,  z  and  a*,  k  =  0,1,..., p, 

£<**(*  -  2o)*  =  H(£0*(:f)(-2o)*~V-  (19) 

k=0  i=0  k=l  W 

Several  functions  can  be  defined  to  summarize  the  GR  algorithm: 

Let  *+’  be  an  operator  that  maps  representations  of  the  same  type  which 
are  expansions  around  the  same  point  z  into  a  representation  around  z  each  of 
its  coefficient  is  the  sum  of  the  appropriate  coefficients  of  the  ‘+"s  arguments. 

Translattx  maps  a  representation  of  a  multipole  expansion  to  a  multipole 
expansion  around  the  center  of  box  z  (equation  14). 

Shi ftx  maps  a  representation  of  a  local  expansion  to  a  local  expansion 
around  the  center  of  box  z  (equation  19). 

Convertz  maps  a  representation  of  a  multipole  expansion  to  a  local  ex¬ 
pansion  around  the  center  of  box  z  (equation  16). 

Under  these  conventions  the  computation  process  by  which  the  GR  algo¬ 
rithm  computes  equations  4  and  7  can  be  written  in  the  following  way: 

rep(4n)  =  52  Translaten(rep($i))  (20) 

<€cMMren(n) 


rep(*n)  =  Shiftn(rep{4fpurent(n)))-\- 


Convertn(rep($i)) 


n«i«Mor«(p«r«ni(n))), 

}tnnghbor${n)} 

(21) 

The  operations  of  the  BH  algorithm  can  be  described  with  functions  bear¬ 
ing  the  same  names  with  one  minor  difference:  Translate  has  all  the  repre¬ 
sentations  of  the  children  of  z  as  arguments  and  the  summation  is  omitted. 
I.e.,  equation  (20)  becomes 

rep($n)  =  Tran3laten(rep($i),rep{$i),  ...,rcp(9i), ...)  all  t,  t  €  children(n) 

(22) 


4  The  Computational  Structure 

This  section  considers  a  network  of  processes,  each  associated  with  a  region  of 
space  (a  cell,  a  rectangular  for  2D,  a  box  for  3D).  Each  process  communicates 
with  its  parents,  its  children  and  its  neighbors  to  compute  the  solution  to  the 
N-body  problem  according  to  the  above  algorithm.  If  a  processor  is  assigned 
to  each  process  and  all  processors  operate  in  a  SIMD  style,  this  architecture 
is  “connection  machine  like”  [12]  in  the  sense  that  there  are  many  equal 
processors  operating  in  SIMD  style  which  communicate  via  many  connections 
to  “neighbors”.  The  difference  between  the  connection  machine  and  this 
network  is  that  this  network  is  not  a  hypercube;  instead,  its  connections  are 
derived  from  the  structure  of  the  equations.  Our  purpose  is  to  find  bounds 
on  the  computation  time  of  this  network  architecture  in  order  to  compare  it 
\<ith  the  connection  machine  and  other  architectures. 

A  look  at  the  equations  shows  us  the  following  communication  paths  for 
each  process  (cell)  C: 

a.  C  to  C’s  parents  (equation  8), 

b.  C’s  parents  to  C  (equation  4), 

c.  C  to  each  of  its  children  (equation  4), 

d.  Each  child  of  C  to  C  (equation  8), 

e.  C  to  C’s  interactive  set  (equation  4), 

f.  Each  member  of  C’s  interactive  set  to  C  (equation  4), 

h.  C  to  its  neighbors  ( h  level  only  for  the  calculation  of  the  near-field), 

i.  Each  neighbor  of  C  to  C  (h  level  only  for  the  calculation  of  the  near¬ 
field). 

If  we  abstain  from  sending  messages  in  both  direction  at  the  same  time 
(as  long  as  the  SIMD  discipline  is  maintained  this  does  not  occur)  a  and  b, 
c  and  d,  and  h  and  i  can  be  satisfied  by  the  same  “wire”.  Moreover,  the 
interactive  set  is  symmetric,  i.e.  for  any  box  a,  6,  a  G  interactive  set(b) 
implies  b  G  interactive  set(a).  Therefore,  e  and  f  can  also  be  united. 

Thus,  we  need  the  following  connections  at  each  process  C: 


I 
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a.  C  to/from  C’a  parents  (8), 

b.  C  to/from  each  of  its  children  (4), 

c.  C  to/from  C’s  interactive  set  (4), 

d.  C  to/from  its  neighbors  ( h  level  only  for  the  calculation  of  the  near- 
field). 

If  each  connection  is  implemented,  connection  machine  style,  by  one  wire, 
we  have  32  wires  in  2D  (1  parent,  4  children,  up  to  27  members  in  interactive 
set).  In  3D  we  have  198  wires  (1  parent,  8  children,  up  to  189  members  in 
interactive  set).  This  number  seems  high;  certainly  it  cannot  be  implemented 
as  one  physical  wire  per  connection.  However,  we  can  reduce  the  number  of 
connections  as  follows. 

The  information  sent  from  C  to  its  interactive  set  members  is  the  same 
as  that  sent  to  its  parents.  C’s  interactive  set  members  can  be  reached  by 
having  C’s  parents  forward  the  information  to  their  (26)  neighbors,  who, in 
turn,  forward  it  to  their  children.  Since  each  message  carries  its  source,  the 
children  can  check  if  the  sender,  C,  is  in  the  child’s  interaction  set  and  discard 
the  information  on  a  negative  answer. 

This  argument  replaces  connection  to  interactive  set  members  with  con¬ 
nections  to  neighbors.  The  number  of  connections  per  cell  becomes  13  for 
2D  and  35  (1  parent,  8  children,  26  neighbors)  for  3D,  which,  although  not 
low,  is  much  better.  In  such  an  arrangement  there  are  connections  to  par¬ 
ents,  to  children  and  to  neighbors  for  all  processors,  including  those  at  the  h 
level.  Figure  2  illustrates  the  connection  structure  in  2D,  the  equivalent  3D 
structure  is  somewhat  difficult  to  draw. 

Figure  3  shows  that  the  computation  structure  has  the  form  of  a  graph 
where  each  node  describes  a  cell  and  each  branch  describes  a  wire  connecting 
two  cells.  The  graph  has  a  form  of  a  pyramid  with  the  top  node  corresponding 
to  the  top  cell  and  the  bottom  nodes  to  the  h  level  cells.  Levels,  or  cells  of 
equal  size,  appear  distinctly  in  the  graph;  they  are  nodes  of  equal  height  from 
the  bottom  or  the  top  of  the  pyramid. 

Consider  an  algorithm  that  performs  the  far-field  computation  in  the 
following  steps: 


Figure  2:  A  2D  cell  (node)  and  its  connections 


Algorithm: 


1.  rep($°):  Compute  rep($°),  for  all  n,  n  in  level  h.  (The  exact  represen¬ 
tation  and  method  of  computation  depends  on  the  method  used  and 
will  be  elaborated  on  in  the  sequel.) 


2.  Calculate  rep($„)  going  up  the  tree:  Given  rep($°),  and  using  equa¬ 
tions  8  and  20  compute  rep($n)  for  all  n,  level  by  level,  starting  in  level 
h  —  1  and  ending  in  level  0. 


3.  For  ail  nodes  of  the  tree  calculate  the  effect  of  the  interactive  set  mem¬ 
bers:  For  all  direction  d  in  a  level  do 


(a)  Each  node  n  sends  to  its  interactive  set  member  that  lies  in  direc¬ 
tion  d  from  it  the  rep($n)  for  all  p,  p  a  child  of  n; 

(b)  Each  node  n  receiving  such  an  message  plays  the  role  of  a  neighbor 
and  retransmits  it  to  all  its  children; 

(c)  Each  node  n  (playing  the  role  of  a  child)  computes  Convert,(rep(^,)) 
for  all  messages  received  from  one  of  its  interactive  set  members 
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Figure  3:  The  computation  “tree”. 

and  discards  the  rest  of  the  messages.  The  result  is  accumulated, 
awaiting  the  arrival  of  rep('Pi)  from  n’s  parent. 

4.  rep( >Ifn):  The  values  of  rep(4'n)  are  propagated  down  the  tree  to  com¬ 
pute  rep(tyn)  for  all  children  according  to  equation  21. 

5.  Evaluation:  For  all  n  in  level  h,  rep(4rn)  is  used  to  calculate  the  far-field 
potential  at  each  mass-point  in  the  cell. 

Note  that  all  the  h  level  nodes  can  perform  step  1  simultaneously.  Simi¬ 
larly,  step  2  is  simultaneously  performed  by  all  nodes  of  the  same  level.  Step 
3  is  simultaneously  performed  in  parallel  by  all  nodes  of  the  entire  network. 
This  holds  if  we  understand  that  a  node  without  children  does  not  send  them 
messages,  etc. 

Some  additional  consideration  of  the  algorithm  shows  that  our  tree  con¬ 
sists  of  three  kinds  of  nodes:  the  top  node,  the  intermediate  nodes,  and  the 
bottom  level  nodes.  Figure  4  illustrates  the  functions  computed  and  the 
information  flow  in  each  node. 

The  above  algorithm  differs  from  the  algorithms  described  in  [9]  and 
in  [11]  in  several  aspects.  First,  the  algorithm  is  really  a  frame  in  which 


Figure  4:  Computation  and  Information  transfer  in  tree  nodes. 


different  functions  may  be  substituted  for  $n,  Translate;,  etc.  The  algo¬ 
rithm  is  described  as  a  “tree”  of  processes  operating  in  parallel  rather  than 
a  single  process  algorithm  as  in  [9],  or  the  level-by-level  parallel  algorithm 
as  in  [ll](page  50).  More  importantly,  this  algorithm  handles  the  informa¬ 
tion  transfer  explicitly  and  so  reduces  both  communication  and  computation 
times.  The  main  tool  for  accomplishing  this  is  the  parallel  structure.  Infor¬ 
mation  is  sent  in  parallel  so  that  messages  do  not  conflict  and  computing  is 
done  in  parallel  by  all  processes  of  the  tree. 

The  network  of  processes  can  operate  also  in  the  “data  flow”  mode  where 
each  process  performs  the  computation  once  the  data  is  available.  The  op¬ 
eration  of  the  network  in  this  mode  has  not  been  fully  investigated  as  yet. 
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5  Communication  Time 

The  time  required  to  solve  the  N-body  problem  by  the  above  computation 
structure  consists  of  the  time  that  each  cell  spends  computing  plus  the  time 
required  to  transfer  the  information  from  process  to  process.  The  later  com¬ 
ponent  is  named  communication  time.  The  communication  time  is  bounded 
under  several  assumptions. 

1.  Time  is  sampled  and  divided  into  intervals;  during  one  time-interval  a 
process  can  both  send  one  message  and  receive  one  message. 

2.  Messages  are  of  bounded  length. 

3.  The  time  to  send  a  message  is  one  “message  time”  unit. 

When  the  messages  Me  small,  one  can  add  the  assumption  that  processes 
can  combine  several  messages  into  one  message.  It  turns  out  that  in  our 
applications  sizes  are  such  that  this  assumption  should  be  be  omitted. 

It  is  further  assumed  that  each  function  representation  is  transferred  by 
one  message  (this  assumption  will  be  modified  later  on).  Under  the  above 
assumptions  we  get  the  following  communication  time  bound  (Notice  that 
under  the  SIMD  assumptions  all  processes  Me  doing  the  same  operation  at 
the  same  time.) 

Consider  the  algorithm  for  fM-field  computation  in  the  previous  section. 
Let  us  assume  that  in  2D  there  are  n  x  n  regions  (n  x  n  x  n  regions  for  3D) 
with  h  =  flog  n] .  Denote  by  n children  the  number  of  children  of  a  processor 
(  processor  level  is  not  h).  Denote  by  n„  the  maximum  number  of  neighbors 
of  a  processor.  Denote  by  niet  the  maximum  number  of  members  in  an 
interactive  set.  The  following  lists  the  communication  time  in  message-time 
units  for  each  step  of  the  algorithm: 

Step  2  Since  each  processor  can  send  one  one  message  and  receive  one  message 
at  a  time,  the  time  is  nChiidren  x  h. 

Step  3  Here  each  processor  sends  nchudren  messages  to  each  neighbor  which,  in 
turn,  sends  one  copy  to  each  of  its  children.  The  time  is  nn(nchudren  + 
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Children)-  Note  that  messages  can  be  sent  and  received  simultane¬ 
ously  by  till  processors  without  interference  provided  that  each  proces¬ 
sor  sends  its  message  in  the  same  “direction”;  i.e.  all  processors  send 
to  the  left  and  receive  from  right,  etc. 


Step  4  Here  each  processor  sends  a  message  to  each  of  its  children.  The  time 

Is  Tt children  k. 


Thus,  the  total  communication  time  in  message  time  units  is: 
communication  time  <  2h.nchiidren  +n„x  (ne/,jwre„  +  n  children)'  (23) 


In  terms  of  actual  time  what  is  “message  time”? 

The  GR  algorithm  specifies  a  function  by  p  coefficients  and  by  the  coor¬ 
dinates  of  the  center  of  the  box  around  which  the  function  is  expanded  (the 
source  of  the  message).  Since  messages  are  function  descriptions,  all  mes¬ 
sages  are  of  equal  length  and  the  concept  of  message  time  is  well  defined  as 
the  time  required  to  send  a  function  description.  For  example  if  p  =  10  and 
each  coefficient  and  the  source  are  represented  by  32  bits,  the  length  of  the 
message  is  about  352  bits.  The  connection  machine  messages,  for  example, 
are  190  bits  long;  therefore,  on  that  machine,  two  actual  messages  are  needed 
to  implement  our  352bit  message  and  the  “message  time”  becomes  the  time 
for  two  connection  machine  messages. 

The  case  of  the  BH  messages  is  somewhat  more  complicated.  Fixed  size 
messages,  each  consisting  of  ( center  of  mass,  mass,  source)  travel  up  the 
tree.  Therefore,  the  contribution  of  steps  2  and  3  of  the  algorithm  is  as 
above,  totaling 


fl.n children  "b  children  "b  ^-children)- 


(24) 


However,  on  the  way  down,  the  number  of  such  triplets  that  pass  each 
node  of  the  tree  equals  the  number  of  interactive  set  members  of  all  the 
node  ancestors.  Moreover,  this  function  description  is  not  a  message  of  fixed 
length  and  the  estimation  of  communication  time  has  to  be  done  differently. 

The  number  of  triplets  that  a  node  at  the  bottom  of  the  tree  (pyramid) 
receives  is  (less  then) 


nset  x  h. 


(25) 


With  k  triplets  per  message,  the  communication  time  is  the  above  expression 
over  k. 

The  following  argument  supports  the  above  expression: 

Choose  an  arbitrary  node,  say  A.  A  sends  down  the  tree  the  triplets  of 
its  interactive  set.  As  soon  as  the  last  triplet  clears  A’s  child,  the  child  can 
send  its  own  interactive  set  triplet  down  the  tree.  Thus,  for  a  tree  of  height 
h  and  a  bound  on  the  number  of  members  in  an  interactive  set,  we  get  the 
above  expression  as  a  bound  on  the  communication  time. 


6  Computation  Time 

This  section  estimates  the  computation  time  resulting  from  the  arithmetic 
operation  only.  It  does  not  include  the  communication  time  and  it  does  not 
include  data  transfers  inside  processors.  The  estimation  is  made  for  both  2D 
and  3D  using  both  GR  and  BH  algorithms. 

The  time  to  perform  floating  point  addition,  multiplication,  etc.  is  de¬ 
noted  by  4,  4  etc.  4,  etc.  denotes  the  corresponding  time  to  perform  the 
corresponding  complex  arithmetic.  It  is  assumed  that  t+  =  2 4>  and  that 
t.  =  4  U  +  2t+. 

6.1  GR  Algorithm  in  2D 

It  is  assumed  that  we  maintain  p  +  1  terms  in  each  expansion. 

6.1.1  $° 

This  calculation  follows  Theorem  2.1.1  in  [9]. 

It  is  assumed  that  in  each  ceil  there  are  m  mass  points  q,  at  points  z,,i  = 
l...m. 

1.  The  calculation  of  z*,  i  =  1,  ...m,  k  =  1,  ...,p,  requires  m(p  —  1)4 

2.  The  calculation  of  a*  requires: 

For  k  =  0  the  time  is  (m  - 1)4;  For  k  ^  0  the  time  is  (m— 1)4  +  2m4 

3.  Summing  the  above  and  replacing  the  complex  arithmetic  entries  by 
their  equivalent  floating  point  arithmetic: 

(m— l)4+2(m— l)p4+m(3p— l)f»)  =  (8mp— 2p-m— l)4-|-m(12p— 4)t. 

(27) 

6.1.2  Translate^ 

This  calculation  follows  Lemma  2.2.1  in  [9],  equation  15  above.  Notice  that 
at  each  level  there  are  only  four  different  values  for  z0.  These  differ  from 
each  other  by  multiplication  by  j  (the  complex  rotation  by  ^).  As  one  goes 
up  a  level  the  corresponding  zq  is  multiplied  by  2.  Thus,  the  values  of  Zq, 


/  =  l,...,p  can  be  assumed  to  be  precalculated.,  and  15  takes  the  following 
format: 


b,  =  - 


l 


+  Ea*0m2A 

k= 1 


*o) 


(28) 


where  level  denotes  the  cell  level,  and  m  =  0, 1,2,3. 
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1.  The  calculation  of  6/  requires: 

For  /  =  0  —  no  operations  required;  For  /  ^  0  the  time  is  3 It,  +  lt+. 

2.  Summing  the  above  and  replacing  the  complex  arithmetic  entries  by 
their  equivalent  floating  point  arithmetic: 

3P— ^  Pt.  +  =  4 (p2  +  p)t+  +  6(p2  +  3 p)t..  (29) 

6.1.3  Convert,, 

This  calculation  follows  Lemma  2.2.2  in  [9],  equation  18  above.  Notice  that 
here  ,  as  in  the  above  section,  at  each  level  there  are  only  se/eral  different 
values  for  z0;  each  corresponds  to  the  position  of  a  cell  relative  to  each  of  its 
interactive  set  members.  Here  too,  as  one  goes  up  a  level  the  corresponding 
z0  is  multiplied  by  2.  Thus,  the  values  of  zl0,  l  =  1,  ...,p  can  be  assumed  to  be 
precalculated;  the  number  of  sets  of  numbers  equals  the  maximum  number 
of  members  in  am  interactive  set.  With  zl0  assumed  known  constants,  and 
taking  into  account  the  multiplications  by  powers  of  2,  we  get  the  following 
operation  counts: 

1.  The  calculation  of  6/  requires: 

For  1  =  0  —  (2  +  2 p)U  +  pt+.  For  1^0  —  (2  +  2 p)f.  +  pt+. 

2.  Summing  the  above  and  replacing  the  complex  arithmetic  entries  by 
their  equivalent  floating  point  arithmetic: 

(2+4p+p2)t.  +  (p+l)p4  =  4(2+4p+p2)t.  +  (2.(2+4p+p2)  +  (p+l)p)4 

(30) 

=  (8  +  16p  +  4p2)f.  +  (4  +  9p  +  3p2)t+  (31) 
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6.1.4  Shift, 


This  calculation  follows  Lemma  2.2.3  in  [9],  equation  19  above.  As  in  the 
Translate,  case,  at  each  level  there  are  only  four  different  values  for  z0 
differing  from  each  other  by  multiplication  by  j  (the  complex  rotation  by  ~). 
As  one  goes  up  a  level  the  corresponding  z0  is  multiplied  by  2.  Thus,  again, 
the  values  of  z‘0,  l  =  l,...,p  can  be  assumed  to  be  precalculated. . 

From  19  follows  that 


1.  The  calculation  of  c;  requires: 

3(p-/)t‘.  +  (p-/)4.  (33) 

2.  Summing  the  above  and  replacing  the  complex  arithmetic  entries  by 
their  equivalent  floating  point  arithmetic: 

!£±i>?(3i-.  +  4)=(£±^(12(.  +  8i+)  (34) 

6.2  GR  Algorithm  in  3D 

The  following  calculations  follow  the  3D  expansion  by  F.  Zhao  [11].  Zhao 
expands  the  various  functions  to  a  series  of  the  derivatives  of  4;  where  R  is 
the  distance  from  the  center  of  a  cell  (whose  function  is  being  computed)  to 
the  point  at  which  the  potential  is  desired.  To  avoid  evaluation  of  (direc¬ 
tion)  cosines,  they  have  been  expressed  in  terms  of  the  cartesian  coordinates 
(xi,X2,£3).  In  the  sequel,  p  is  the  order  of  the  highest  derivative  retained. 
The  number  of  terms  in  the  expansion,  denoted  by  np,  is  derived  from  p. 

6.2.1 

This  calculation  follows  Theorem  3.2.1  in  [11]. 

It  is  assumed  that  in  each  cell  there  are  m  mass  points  q ,  at  points 
(xi.,x2,,x3,),  i  =  1  ...m. 

avk  = 

(=1  l.J.K. 
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(35) 


(36) 


If  we  retain  only  terms  of  t  4-  j  +  k  <  p  then  the  number  of  terms  is 

»„  =  E  E  'e  ’  i »  k+lM+vk+li  (36) 

k=Oj=0  1=0  0 

which  results  in  =  4,  —  10,  03  =  20,  n4  =  35,  etc. 

1.  Consider  the  following  recursive  relations  between  the  coefficients: 

<*0,0,0  =  1  (37) 

(-1)*.-  _  .  (-1)*;  .  (-!)■**  /?ox 

ai,j,k  -  Qt—ijjt - : -  -  ai,j-l,k - : -  ai,i,k- 1 - r —  (00) 

l  J  K 

2.  There  are  np  recursive  steps  each  corresponding  to  a  different  a,j*. 

3.  Each  recursive  step  requires  3 1,  (*  is  stored  as  7). 

4.  Tne  above  is  repeated  for  m  particles  and  accumulated,  giving  the 
result  (m  —  l)t+. 

5.  Summing  all  up  results  in  the  following  bound 


np(3mf .  +  (m  -  1  )t+) 


6.2.2  Translate, 


This  calculation  follows  from  Lemma  3.2.1  in  [11].  The  coefficients  of  the 
translated  expansion,  c,j*,  are 

Cijk  =  EEE  a°0ia',-a<]-0,k-y  (4°) 

a=O0=O 7=0 

Here  a'^  are  the  coefficients  in  the  expansion  for  1  JR,  measured  from  the 
old  cell  center  with  respect  to  the  new  center. 

1 .  The  a'jk  are  constants  that  can  be  precomputed  The  number  of  differ¬ 
ent  sets  of  such  constants  equals  the  number  of  children  of  a  cell  —  8 
in  3D.  The  constants  at  different  levels  can  be  derived  by  multiplying 
by  appropriate  powers  of  2. 


2.  The  number  of  terms  in  c,j*  is  (»'  4-  1  )(j  +  l)(Ar  +  1). 

3.  Each  term  is  a  multiplication  of  an  a  , a  power  of  2  and  an  a^*;  therefore 
each  term  requires  2t.  +  t+.  Thus,  each  Cijk  requires  (i  +  \){j  +  l)(jfe  + 

l)(2<.  +  4). 

4.  Summing  all  up,  t  +  4-  k  <  p,  results  in 

£  £  £  (*  +  1)0  +  m  4-  l)(2t.  +  4)  =  (41) 

k=0  j=0  i=0 

y£(p6+21p5+175p4+735p3+1624p2+1764/H-720)(2t.+t+)  p(2t.+t. 

(42) 

The  above  expression  (caculated  by  F.  Zhao  using  MAXIMA  [19])is  denoted 
by  kp.  Thus,  the  time  is 

M2*.  +  4)-  (43) 

6.2.3  Convertt 

This  calculation  follows  from  Lemma  3.2.2  in  [11].  The  coefficients  of  the 
converted  expansion,  c^are 

1  00 

=  juili  £  aa0yk+c'j+0,k+y{i  4  a)!(j  4-  0)\{k  4-  7)!  (44) 

bijk  are  the  coefficients  of  the  local  expansions  for  1/72,  measured  from  the 
old  cell  center,  with  respect  to  the  new  center. 

1.  The  bijk  are  constants  that  can  be  precomputed.  The  number  of  dif¬ 
ferent  sets  of  such  constants  in  the  number  of  interactive  set  members 
of  a  cell  —  189  in  3D.  The  constants  of  different  levels  can  be  derived 
by  multiplying  by  appropriate  powers  of  2. 

2.  Each  term  of  c,;fc  requires  at  most  (2f.  4-  4).  (There  are  three  items; 
dap-y,  a  power  of  2, and  a  constant.) 


I. 


3.  To  calculate  the  contribution  of  all  terms  note  that  both  atJ*  and 
are  zero  if  one  of  their  coefficients  is  larger  then  p.  Thus,  the  amount 
of  time  is 

p  p-kp-k-j  p  p-ap-a-0 

EE  E  EE  E  P'-+4)=  (45) 

k=0  i—0  t=0  or =Jfe  0=j  1=1 

=  ^(3 p*  +  22 p3  +  57p2  +  62p  +  24)(2t.  +  t+)  ^  c„(2t.  +  t+)  (46) 

A  crude  upper  bound  can  be  derived  simply  in  the  following  way;  If  the 
vanishing  of  the  6,j*  is  not  accounted  for,  the  estimate  for  the  number  of 
terms  of  c^*  is  np;  Since  the  number  of  c^t’s  is  np,  the  bound  on  time  is 
np(2t.  +  t+).  The  ratio  between  the  bound  of  equation  46  and  this  last 
bound  is  about  p/np;  This  illustrate  the  advantage  in  the  result  of  equation 
46. 

6.2.4  Shiftz 

This  calculation  follows  from  Lemma  3.2.3  in  [11].  The  coefficients  of  the 
converted  expansion,  d.^, are 

d*]k=  £  c*+aj+0Ml  ^  ^  7)  ^X?Axf  Ax^.  (47) 

Here  (Axi,  Ax2,  AX3)  aue  the  coordinates  of  the  old  center  in  terms  of  the 
new  center. 

1.  The  Ax,-  are  constants  that  can  be  precomputed.  The  number  of  dif¬ 
ferent  sets  of  such  constants  is  the  number  of  children  of  a  cell  —  8  in 
3D.  The  constants  of  different  levels  can  be  derived  by  multiplying  by 
appropriate  powers  of  2. 

2.  Each  term  of  dijk  requires  at  most  (3f.  +  t+)  (there  are  four  terms:  c,}k, 
a  power  of  2,  the  combinations  and  the  A’s). 


ws 


ww 
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3.  Summing  over  all  terms  and  all  coefficients  results  in  the  following 
bound 

p  p-kp-k-j  p  p-ap-a-0 

Y1  H  12  IT  51  (3 t.  +  t+) 

k= 0  J= 0  «=0  o>=*  /J=;  -ysri 


(48) 


1 


=  —  (3p*  +  22p3  +  57p2  4-  63p  +  24)(3t*  +  t+)  —  Cp(3t,  +  t+)  (49) 
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6.2.5  Evaluation 


The  evaluation  of  the  potential  at  each  mass  point  requires  the  evalu¬ 
ation  of  <&„,  n  at  level  h,  using  equation 


*J,n(^)  —  /*  (  ^ijkX\X\x 3, 

>.J,k=0 


for  the  potential.  The  force  can  be  obtained  by  taking  the  partial 
derivatives;  Thus, 


p 

I 

i,i,k=0 


fc,(x)  —  dijkixl  £2X3. 


<  — 1  ~J  —k 


The  potential  can  be  evaluated  with  two  multiplications  and  one  addi¬ 
tion  per  term  (keep  x\x2x*  for  each  ijk  and  proceed  by  increasing  the 
indices  one  at  a  time).  For  m  mass  points  in  z ,  the  evaluation  of  the 
potential  requires 

mnp(t+  +  2t.).  (50) 


For  m  mass  points  in  z,  the  evaluation  of  the  force  requires  about 

m(np{t+ +2t.) +  3npta).  (51) 


The  above  was  derived  by  assuming  that  first  the  members  of  the  series 
were  calculated  with  a  power  smaller  by  one  for  each  x,-,  and,  next,  for 
each  component  of  the  force,  the  series  components  where  multiplied 
by  the  omitted  components.  (I.e.,  multiplying  by  1X2X3  for  Fx  1  and 
ignoring  the  operations  for  forming  1x2X3.) 
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6.3  BH  Algorithm  in  3D 


The  2D  BH  algorithm  is  similar  to  the  3D  version  ;  its  detailed  opera¬ 
tion  count  is  omitted. 

Only  the  calculation  of  $5  ^d  Translate ,,  as  well  as  the  final  evalu¬ 
ation,  have  significant  operation  counts;  Convert,  and  Shiftz  amount 
to  concatenation  of  lists. 


6.3.1 

The  following  time  is  obtain  from  equation  10  (in  3D): 

(m  -  1  )t+  +3mt,  -I-  3(m  —  1)4  +3t/  =  3mt.  +4(m  —  1)4  +  3 1/.  (52) 

6.3.2  Translatez 

Translation  is  the  above  calculation  with  m  =  4: 

12t.  -I-  12t+  +  3 1,.  (53)  . 


6.3.3  Evaluation 


The  evaluation  of  the  potential,  or  force,  at  each  mass  point  requires  the 
evaluation  of  each  member  of  the  list  representing  n  at  level  k,  us¬ 
ing  equation  ??  for  the  potential  and  equation  1  for  the  force.  ,  n  at 
level  h ,  has  at  most  (h— 2)*(maximum  number  of  members  in  an  interactive 
elements.  That  maximum  is  189  in  3D. 

The  force  between  two  points  (equation  1  )  is  evaluated  according  to 


F:,  =  G- 


Af,  A x,  z. 


x-y 


x-y 


(54) 


and  requires  (3t.  -I-  24  +  3t_)  +  15t.  +  3t_  +  3t/  where  the  15  is  an 
estimate  of  the  time  for  evaluating  a  square  root.  With  t_  =  4  the 
result  is:  18t.  +  84  +  3f/. 


Thus,  for  m  elements  and  h  levels,  the  time  is: 


i 

I* 

$ 

| 

l 


189.m.(A  -  2)(18t.  +  84  +  3 t,  +  1) 


7  A  Comparison 


In  order  to  gain  insight  into  the  above  results  this  section  maps  the  pyramid 
of  processes  to  three  different  processor  configurations: 

•  A  pyramid  of  processors  corresponding  to  the  processes  pyramid  graph, 
where  each  vertex  is  assigned  a  process  and  each  arc  a  communication 
“wire”. 

•  An  hypercube  of  processors  ,i.e.,  a  connection- machine  like  architecture 

[12]. 

•  A  smaller  array  of  larger  processors  ,  e.g.,  2x2x2  processors. 

The  performance  of  the  BH  and  the  GR  algorithms  are  compared  by  calcu¬ 
lating  their  communication  and  computation  times  in  these  configurations 
The  calculation  is  done  for  several  typical  values  of  N ,  the  total  number  of 
particles,  m  number  of  particles  in  a  cell,  p  and  np  that  determine  the  number 
of  coefficients  that  the  expansion  carries.  To  get  some  idea  of  actual  time 
in  seconds  the  speed  of  communication  and  floating  point  operations  of  the 
CM2  model  connection  machine  is  used. 

7.1  The  BH  Algorithm  -  Comparative  Speed 

The  section  starts  by  summarizing  results  from  previous  sections. 

messages:  It  is  assumed  that  the  BH  algorithm  triplet  consists  of  4  single 
precision  32bit  word  plus  one  32bit  word  for  specifying  the  origin  of  the  cell. 
Thus,  the  160  bit  message  fits  into  the  190  bit  connection  machine  message; 
message-time  unit  becomes  the  time  for  one  message  on  the  CM2  connection 
machine  which  is  about  0.250  millisecond. 

Communication  time:  In  3D  we  have  8  children,  26  neighbors,  and  189 
as  the  maximum  size  of  am  interactive  set.  Thus,  the  communication  time  is 
197A  +  1872  (equation  26). 

Computation  time:  To  simplify  the  discussion  it  is  assumed  t+  =  t,; 
t/  —  5t+.  Under  these  simplifying  assumptions  the  time  spent  in  computing 


2.  Translate t:  39 ht+‘, 

3.  Evaluation:  189m(A  —  2)(41)4- 


Cells  and  particles:  Let  the  number  of  cells  in  the  bottom  level, A,  be 
N  =  n  x  n  x  n  ;  then  h  ~  flog2  n] .  The  total  number  of  cells  is: 

Wd  +  i  +  i +  ...)<  (56) 

Since  the  number  of  processors  in  the  connection  machine  is  64 K,  we 
would  like  to  restrict  the  number  of  cells  to  this  number. 

The  number  of  particles  is  mN. 

Summary:  The  following  table  summarizes  communication  and  compu¬ 
tation  times  for  selected  values  of  n  and  m.  The  communication  time  is  in 
message-time  units  and  the  computation  time  is  in  multiples  of  t+. 


n 

m 

particles 

h 

communication 

translatez 

n 

evaluation 

message  units 

4 

4 

4 

32 

1 

32  K 

5 

2857 

195 

18 

23247 

32 

10 

320 K 

5 

2857 

195 

81 

232470 

32 

20 

640 K 

5 

2857 

195 

151 

464940 

38 

1 

62712 

6 

3054 

236 

18 

30996 

38 

10 

627120 

6 

3054 

236 

81 

309960 

38 

20 

1250/tT 

6 

3054 

236 

151 

619920 

To  get  some  insight  into  these  numbers  we  recast  the  above  table  using 
20Kflops  per  second  for  t+  (CM2  model  connection  machine  measured  time 
[14]),  and  250  microseconds  per  message  (CM2  model  connection  machine 
message  time  [15]).  Time  in  the  table  is  expressed  in  milliseconds  and  is 
rounded. 


n 

m 

particles 

h 

communication 

millisec 

translatez 

evaluation 

millisec 

32 

1 

32 1< 

5 

714 

-  — 

1150 

32 

10 

320 K 

5 

714 

—  — 

11500 

32 

20 

640 K 

5 

714 

-  — 

23250 

38 

1 

62712 

6 

763.5 

-  — 

1550 

38 

10 

627120 

6 

763.5 

-  - 

15500 

38 

20 

1250# 

6 

763.5 

—  — 

31000 

The  table  illustrates 

i  several  characteristics  of  this  computation  scheme; 

.1 


•«1  ’ '4# _ 


1.  While  the  communication  time  is  substantial,  it  does  not  increase  with 
N  so  long  as  the  height  of  the  tree  is  constant.  Neither,  in  fact,  do 
any  of  the  quantities  associated  with  the  “tree”.  These  quantities  are 
a  function  of  the  size  of  the  tree,  i.e.,  of  n  and  h,  but  not  the  number 
of  particles. 

2.  With  a  low  number  of  particles  most  of  the  time  is  communication 
time. 

3.  Since  the  number  of  processors  in  the  connection  machine  is  fixed,  to 
reach  a  high  number  of  particles  m  has  to  increase.  This  shifts  the 
computation  load  to  the  evaluation  and  to  the  bottom  level  ( h  level)  of 
processors.  They  do  most  of  the  work  while  the  rest  of  the  processors 
are  idle. 

4.  The  evaluation  time  could  be  reduced  significantly  by  increasing  the 
number  of  processors  so  that  m  remains  1.  This  cannot  be  done  in  the 
connection  machine  that  has  a  fixed  number  of  processors. 

7.2  The  GR  Algorithm  -  Comparative  Speed 

The  section  starts  by  summarizing  results  from  previous  sections. 

messages:  It  is  assumed  that  the  GR  algorithm  uses  np  coefficients;  each 
coefficient  is  represented  by  a  single  precision  32bit  word  plus  one  32bit  word 
for  specifying  the  origin  of  the  cell.  Thus,  the  length  of  a  message  is  32np 
bits.  A  message-time  unit  is  the  time  for  one  such  message;  the  exact  time 
on  any  particular  hardware, say,  the  connection  machine,  depends  on  np. 

Communication  time:  In  3D  we  have  8  children  and  189  as  the  maxi¬ 
mum  size  of  an  interactive  set.  Thus,  the  communication  time  is  16 h  +  26.72 
(equation  24). 

Computation  time:  To  simplify  the  discussion  it  is  assumed  t+  =  f.; 
t/  =  5f+.  Under  these  simplifying  assumptions  the  time  spent  in  computing 


1.  $n°: 


np(4m  -  1  )f+; 


I 


2.  Translate,: 


3.  Convert 


4.  Shift,: 


5.  Evaluation:  Potential: 


Force: 


3  kpt+\ 


3cp4; 


4cp4; 


3  mnpt+] 


6  mnpt+. 


Denote  by  nn  the  number  of  neighbors  and  by  nch,idri.n  the  number  of 
children.  The  total  time  is: 


T{®n°)+h.T(Translatet)+(nn).nctoi<iren-T(Convert,)+h.T(Shiftz)+T(Evaluation) 

(57) 

(np(4m  -  1)  +  3 hkp  +  3 nnncMirenCv  +  Ahcj,  +  3 mnp)4  (58) 

(np(7m  -  1)  +  (3fcp  4-  4cp)/i  +  3nnnCAjMrenCp)4  (59) 

For  reference  in  the  sequel  it  is  convenient  to  denote 

P  =  np(7m  -  1)4  (60) 

which  is  the  time  spent  in  the  leaves  of  the  tree,  and  to  denote 

Q  ((3Ap  +  4Cp)/l  +  3nnncflildrenc?)t+  (61) 

which  is  the  time  spent  in  the  tree  nodes  that  are  not  leaves. 

Cells  and  particles:  Let  the  number  of  cells  in  the  bottom  level, A,  be 
N  =  n  x  n  x  n  ;  then  h  =  flog2  n] .  The  total  number  of  cells  is: 

'v(1  +  i  +  54  +  "')-^  <62> 

Since  the  number  of  processors  in  the  connection  machine  is  64 K  we 
would  like  to  restrict  the  number  of  cells  to  this  number. 
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The  total  number  of  particles  is  mN. 

Summary;  The  following  table  summarizes  communication  and  compu¬ 
tation  times  for  selected  values  of  n,  m  and  p.  The  computation  time  is 
stated  in  terms  of  t+.  The  communication  time  is  stated  in  terms  of  message 
units. 

The  time  for  Translate-!-  Convert+  Shift  is  denoted  by  Q.  This  time  is 
independent  of  m.  In  the  table  this  time  is  separated  from  the  time  which 
depends  on  m  and  which  represents  operations  done  by  the  bottom  level 
processes. 


n 

m 

particles 

h 

P 

np 

com¬ 

Q 

$n°  +  evaluation 

munication 

message  units 

4 

4 

32 

1 

32  # 

5 

1 

4 

1952 

4571 

24 

32 

1 

32  # 

5 

2 

10 

1952 

15732 

60 

32 

1 

32  # 

5 

3 

20 

1952 

42685 

120 

32 

10 

320 # 

5 

1 

4 

1952 

4571 

276 

32 

10 

32  OK 

5 

2 

10 

1952 

15732 

690 

32 

10 

32  0# 

5 

3 

20 

1952 

42685 

1380 

32 

20 

640 K 

5 

1 

4 

1952 

4571 

556 

32 

20 

640 # 

5 

2 

10 

1952 

15732 

1390 

32 

20 

640# 

5 

3 

20 

1952 

42685 

2780 

38 

1 

62712 

6 

1 

4 

1968 

4620 

24 

38 

1 

62712 

6 

2 

10 

1968 

15912 

60 

38 

1 

62712 

6 

3 

20 

1968 

43188 

120 

38 

10 

627120 

6 

1 

4 

1968 

4620 

276 

38 

10 

627120 

6 

2 

10 

1968 

15912 

690 

38 

10 

627120 

6 

3 

20 

1968 

43188 

1380 

38 

20 

1250# 

6 

1 

4 

1968 

4620 

556 

38 

20 

1250# 

6 

2 

10 

1968 

15912 

1390 

38 

20 

1250# 

6 

3 

20 

1968 

43188 

2780 

The  above  table  is  translated  below  to  “real”  numbers:  the  communica¬ 
tion  time  assumes  a  190bit  message  and  coefficients  of  32bit;  the  computation 
time  assumes  20Kflops  per  second,  typical  of  the  CM2  model  connection  ma¬ 
chine  [14].  The  communication  time  is  assumed  to  be  250  microseconds  per 
message  [15]. 
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3 

20 
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This  table  illustrates  several  characteristics  of  this  computation  scheme: 

1.  As  in  the  BH  case,  the  communication  time  does  not  increase  with  m. 
Neither  do,  in  fact,  any  of  the  quantities  associated  with  the  “tree”. 
These  quantities  are  a  function  of  the  size  of  the  tree,  i.e.,  n  and  h,  but 
not  of  the  number  of  particles. 

2.  Unlike  the  BH  case,  most  of  the  time  is  spent  in  the  “body”  of  the  tree 
performing  Convert  -  calculating  the  effect  of  the  interactive  set.  The 
evaluation  stage  is  relatively  fast. 

3.  The  GR  algorithm  seems  to  be  several  times  faster  than  the  BH  algo¬ 
rithm  for  a  large  number  of  particles  arranged  as  indicated  below.  It 
requires  less  communication  time  and  less  computing  time.  For  a  small 
number  of  particles  (say,  32K  particles)  the  GR  algorithm  with  p  =  1  is 


faster  then  the  BH  algorithm.  An  accurate  comparison  requires  com¬ 
paring  performance  for  the  same  global  error.  Such  an  analysis  has  not 
as  yet  been  done. 
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4.  Unlike  the  BH  algorithm,  the  GR  algorithm  utilizes  all  the  processors 
more  or  less  equally.  In  considering  the  replacement  of  a  connection 
network  by  fewer,  faster  processors  we  can  take  advantage  of  the  fact 
that  the  bottom  level  is  significantly  busier  than  the  rest  of  the  com¬ 
puting  elements;  using  the  GR  algorithm  all  the  processors  have  to  be 
accounted  for  and  replaced  (more  on  this  subject  in  the  sequel). 

7.3  Mapping  the  “pyramid”  into  a  hypercube 

The  mapping  of  the  network  of  processes  into  an  hypercube  is  preparation 
for  executing  the  algorithm  on  a  connection-machine-like  computer.  The 
interest  here  is  twofold.  How  fast  can  a  hypercube  machine  simulate  the 
network  and  how  can  the  n-body  computation  be  organized  on  the  hypercube. 
The  definitions  in  this  section  follow  [17].  The  actual  mapping  is  a  rather 
straightforward  generalization  of  [18]. 

An  embedding  <  <f>,p  >  of  a  graph  G  =  ( Vg,Eq )  into  a  graph  H  = 
( Vh,Eh )  is  defined  by  an  injective  mapping  from  Vq  to  Vh,  together  with 
a  mapping  p  that  maps  (u,u)  €  Eq  onto  a  path  p(4>(u),<f>(v))  in  H  that 
connects  <£(tt)  and  <p(v). 

The  quality  of  an  embedding  is  measured  by  three  cost  functions  —  ex¬ 
pansion,  dilation, and  load  factor.  The  expansion  of  an  embedding  <  <f>,p  > 
of  G  into  H  is  the  ratio  of  the  size  of  Vh  to  the  size  of  Vg-  The  dilation  of 
an  edge  (u,  u)  under  <  <j>,  p  >  is  the  length  of  the  path  p{<f>{u),  4>{v))  in  H. 
The  dilation  of  an  embedding  <  <j>,p  >  is  the  maximum  edge  dilation,  over 
all  edges  in  G,  under  <  4>,p  >  .  The  load  factor  A(e)  of  an  edge  e  in  H  is  the 
number  of  paths  that  pass  through  e  which  are  images  of  edges  in  G.  The 
load  factor  of  an  embedding  is  defined  to  be  the  maximum  load-factor  over 
all  edges  in  H. 

Consider  the  2D  case  first.  An  n  x  n  pyramid  graph  is  formed  from  meshes 
of  size  n  x  n,  j  x  j,  J  x  J,  j  x  J,  ...,  lxl  connected  as  implied  by  figure 
5.  Note  that  n  is  a  power  of  2  and  that  if  each  node  is  one  of  our  cells, 
this  pyramid  does  not  contain  the  connection  to  the  “diagonal”  neighbors. 
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(Omission  of  the  diagonal  link  does  not  change  d  in  either  the  two  or  three 
dimensional  cases.  As  it  is  shown  in  the  sequel,  d  is  the  important  quantity 
in  our  case.) 


Figure  5:  A  2  x  2  and  a  4  x  4  pyramid 

Lemma  7.1  An  n  xn  pyramid  can  be  embedded  in  a  2 n2  node  hypercube  so 
that  dilation  of  the  embedding  is  3  and  the  load- factor  is  2. 

See  [18]  for  a  proof. 

Lemma  7.2  Annxnxn  pyramid  can  be  embedded  in  a  2 n3  node  hypercube 
so  that  dilation  of  the  embedding  is  4  and  the  load-factor  is  3. 

The  lemma  can  be  proved  similarly  to  the  proof  in  [18]. 

Clearly,  we  are  interested  in  a  mapping  that  simulates  the  tree  well.  Since 
each  process  is  mapped  to  a  different  processor,  the  issue  is  an  issue  of  com¬ 
munication  time  rather  then  computation  time.  Maps  differ  from  each  other 
in  the  amount  of  communication  overhead  they  introduce  as  compared  to 
the  tree.  While  there  is  no  claim  that  the  above  mapping  is  optimal,  it  is 
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Figure  6:  A  simple  2D  embedding. 

attractive  since  neighboring  nodes  remain  close  together.  Moreover,  we  can 
bound  the  number  of  cycles  of  the  hypercube  machine  needed  to  simulate  a 
cycle  of  the  pyramid  machine. 

[17]  considers  the  case  of  simulating  G  by  H  where  all  nodes  sends  mes¬ 
sages  at  the  same  time  on  all  adjacent  wires.  The  resulting  bounds  are 
max{d,  A}  <  T  <  d\  where  T  is  the  number  of  cycles  of  H  required  to  sim¬ 
ulate  any  cycle  of  G.  These  bounds  are  too  pessimistic  for  our  case.  In  our 
case  an  upper  bound  is  simply  the  dilation  d.  This  is  indeed  not  the  bound 
on  a  general  computation  performed  on  a  pyramid  network  but  a  bound  on 
our  own  algorithm  with  the  above  mapping.  The  bound  results  from  the  fact 
that  in  this  mapping  the  only  overlapping  paths  are  those  from  a  node  to 
its  children.  Since  only  one  child  at  a  time  sends  messages  to  its  parent,  no 
conflict  arises.  Since  in  most  part  of  the  algorithms  the  information  consists 
of  several  messages  that  can  be  queued  inside  a  process,  it  is  believed  that  d 
is  actually  a  conservative  upper  bound. 

The  conclusion  is  that  hypercube  communication  can  be  bounded  by  d 
multiplied  by  the  communication  performance  of  the  pyramid.  We  believe 
that  this  is  a  rather  conservative  estimation  of  performance;  with  the  pyramid 
mapped  as  above  the  average  performance  of  the  connection  machine  will  be 
very  close  to  that  of  the  pyramid  machine. 

The  expansion  story  is  less  attractive.  A  32  x  32  x  32  base  (32K  cells  in 
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Figure  7:  A  simple  3D  embedding. 

the  bottom  level)  requires  64K  processors,  ^  of  them  not  participating  in 
the  computation. 

7.4  A  Small  Number  of  Larger  Processors 

This  section  considers  the  implementation  of  the  algorithm  by  a  small  num¬ 
ber  of  larger  processors.  Both  “small  number”  and  “larger”  are  with  respect 
to  the  connection  machine.  We  consider  either  1  processor,  or  ,  for  3D  arrays 
of  2  x  2  x  2,  4  x  4  x  4,  etc.  of  processors.  What  we  have  in  mind  for  a 
processor  is  a  board  with  fast  arithmetic  chips,  memory  and  a  controller. 
Currently,  a  board  like  this  cam  deliver  about  20M  floating  point  operations 
per  second  and  has  a  memory  of  several  million  words  of  32  bits  each.  The  in¬ 
terconnection  among  neighboring  processors  is  via  32-bit  buses.  To  facilitate 
simple  programming  and  symmetric  information  transfer  between  processors 
the  buses  are  interconnected  as  illustrated  by  figure  8.  The  processors  are 
associated  with  grid  points.  Each  processor  has  an  input  bus  and  an  output 
bus.  The  input  bus  of  a  processor  is  connected  (via  tri-state  gates)  to  the 
output  buses  of  all  its  neighbors.  The  output  bus  of  a  processor  is  connected 
(via  tri-state  gates)  to  the  input  buses  of  all  its  neighbors. 
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Figure  8:  Interconnection  among  neighboring  processors. 

The  pyramid  of  processes  has  to  be  partitioned  and  assigned  to  individual 
processors.  It  is  assumed  that  this  partition  is  symmetric;  The  base  of  the 
pyramid  is  equally  divided  among  the  processors,  each  getting  a  smaller 
pyramid;  whatever  is  left  is  divided  again,  or  assigned  to  one  processor.  For 
example,  a  32  x  32  base  pyramid  of  processors  is  divided  among  4  processors; 
Each  gets  a  16  x  16  base  pyramid;  the  top  process  remains  and  has  to  be 
assigned  to  one  of  the  processors. 

There  are  several  differences  between  this  computation  mode  and  the  ones 
discussed  before.  First,  no  messages  have  to  be  sent  between  two  processes 
that  resides  on  the  same  processors  (local  processes).  (A  process  accesses  the 
results  of  another  process  directly.)  Thus,  in  this  case,  the  communication 
time  is  negligible.  Second,  a  processor  evaluates  its  processes  sequentially; 
parallelism  exists  between  processors  only.  Thus,  the  evaluation  order  has  to 
be  such  that  no  processor  will  be  waiting  for  data  from  another  processor. 

The  transfer  of  information  between  processors  is  different.  It  is  assumed 
that  each  processor  has  a  DMA  channel  so  that  data  transfer  to  and  from 
buses  can  be  done  simultaneously  with  computing.  In  the  pyramid  case,  to 
save  “wires”,  information  was  routed  to  interactive  set  members  through  par- 
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ents.  Clearly,  this  does  not  make  any  sense  for  local  processes;  for  processes 
in  different  processors  the  issue  is  how  not  to  send  the  same  information  more 
than  is  necessary.  Therefore,  when  a  cell  has  a  result  which  is  needed  by  a 
cell  in  another  processor,  the  result  is  sent  over;  the  other  processor  has  a 
memory  cell  for  each  cell  on  other  processor  that  is  in  one  of  its  own  cells 
interactive  sets.  The  arriving  information  is  stored  there  to  be  used  by  all 
cells  that  need  it. 

Let  us  consider  a  pyramid  of  processes  with  a  base  of  n  x  n  x  n  (e,g., 
n  =  32)  and  an  array  (of  larger  processors)  of  size  kxkxk,  where  k  =  2, 4, 8, 
etc. 

The  pyramid  of  processes  is  partitioned  among  the  array’s  processors  so 
that  each  processor  gets  a  pyramid  of  base  ^  x  £  x  J .  The  top  of  the  original 
pyramid  is  not  covered;  In  the  case  of  base  32  pyramid,  which  is  used  here 
for  comparison,  and  arrays  of  base  2,  4,  and  8,  the  top  is  a  pyramid  with 
base  lxlxl, 2x2x2,  and  4x4x4,  respectively.  It  is  assumed  that  this 
tip  is  either  assigned  to  one  of  the  processors  or  divide  among  them.  For  the 
cases  k  —  2,4,8  the  contribution  of  the  tip  pyramid  to  the  total  time  can  be 
ignored  and  this  is  what  is  done  in  the  sequel. 

It  is  assumed  that  the  systems  of  processors  uses  the  same  algorithm 
as  in  section  4,  except  that  each  processor  services  all  processes  in  its  own 
pyramid.  The  processors  work  in  parallel  and  in  synchronization  so  that  no 
processor  has  to  wait  for  the  results  of  its  fellow  processors. 

It  is  not  difficult  to  see  that  each  processor  implements  somewhat  less 
than  f(f  )3  processes.  Similarly,  a  processor  has  to  implement  the  communi¬ 
cation  between  the  processes;  the  time  for  communication  between  processors 
residing  on  the  same  processor  can  be  ignored.  The  communication  time  be¬ 
tween  processors  has  to  be  accounted  for.  This  communication  is  represented 
by  “wires”  to  from  processes  on  the  pyramid  faces  to  neighboring  processes 
which  reside  on  neighboring  processors.  In  3D  the  number  of  these  wires  is 
proportional  to  (|)2  and  the  sequel  considers  their  effect  on  the  communica¬ 
tion  time. 

The  calculation  of  the  time  computation  is  based  on  20M  floating  point 
operations  per  seconds  and  the  communication  time  is  based  on  transfer  time 
of  80  nanoseconds  per  32  bits. 

Communication  time:  It  is  assumed  that  each  processor  has  pseudo¬ 
cells  ,  or  p-cells,  one  p-cell  for  each  cell  z,  such  that  z  is  a  member  of  an 


interactive  set  of  a  cell  in  the  processor  s  pyramid,  z  not  in  the  pyramid. 

The  number  of  such  cells  can  be  calculated.  For  a  cell  y,  y  a  cell  on  the 
pyramid  face,  these  are  the  children(neighbors(parent(y)))  which  are  not  in 
the  pyramid.  (Note  that  parent(y)  is  in  the  pyramid  and  that  we  are  inter¬ 
ested  in  neighbor  s(parent(y))  which  are  not  in  the  pyramid.)  If  we  “cover” 
the  pyramid  faces  with  such  cells,  this  cover  includes  all  the  interactive  set 
members  of  cells  in  the  pyramid  where  the  members  are  not  in  the  pyramid. 

The  number  of  cells  in  this  cover  is: 

n  n  n 

(number  of  children) x (number  of  cells  in  the  faces  of  a  —  x  — x  —  pyramid) 

2  k  Ik  2  k 

(63) 

The  number  of  cells  in  a  face  of  a  n  x  n  x  n  pyramid  is 

”2(1 + \ + h + ■") <  r*2  (64) 

Thus,  the  number  of  cells  in  the  above  cover  becomes 


8(^(f)Jx6  +  81og(^))  (65) 

The  last  term  are  cells  that  cover  the  corners.  This  last  term  can  be  just 
ignored. 

Thus,  the  number  of  p-cells  is  about  16(£)2  and  this  is  the  number  of 
messages  the  processor  has  to  send  or  to  receive  (ignoring  the  fact  that  some 
processors  do  not  have  neighbors  on  all  directions). 

The  GR  algorithm  defines  the  content  of  such  a  cell  by  nv  +  3  coefficients 
(the  3  represents  the  coordinates  of  the  source).  Thus,  the  communication 
time  is 

(»,  +  3)x32xl6(j)1(5£ii£L)  (66) 

Similarly,  the  BH  algorithm  defines  the  content  of  such  a  cell  by  1  + 
3  coefficients  (the  3  represents  the  coordinates  of  the  source).  Thus,  the 
communication  time  is 

4  x  32  x  16(^)V0-*32~-)  (67) 

Computation  time:  A  processor  that  implement  a  £  *  *  x  j  base 
pyramid  has  to  perform  all  the  operations  that  its  processes  do. 


For  the  3D  GR  algorithm  the  computation  time  is: 

38  8  8 

(7)  (T($n0)+-T(Tran3latet)+-nietT(Convertx)+-T(Shiftx)+T(Evaluation)) 
k  7  1  7 

(68) 

Using  equations  (39),  (43),  (46),  (49)  and  (50),  the  expression  becomes 

( £)  (np(4m  -  1)  +  3y fcp  +  3njelCp  +  4 yCp  +  3 mnp)t+  (69) 

which  results  in 

(£)%p(7 m  -  1)  +  *(MP  +  (3n.et  +  4)cp))t+.  (70) 

The  computation  time  of  the  3D  BH  algorithm  almost  entirely  consists 
of  the  evaluation  time.  Thus,  for  a  j  x  |  x  j  base  pyramid  the  computation 
time  is 

fl  3  1 

(—)  (evaluation  time  for  one  processor  of  the  pyramid)—— 
k  1000 

GR  algorithm  summary:  The  following  table  summarizes  the  compu¬ 
tation  and  communication  time  for  the  GR  algorithm  and  for  various  values 
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65 
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15.6 
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0.67 
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The  above  numerical  results  can  be  compared  with  the  the  performance  of 
the  GR  algorithm  on  the  pyramid  (section  7.2).  The  following  table  compares 
the  performance  for  p  =  3.  Define  Ratio 


Ratio  == 


( total  time  pyramid) 
(total  time  k  x  k  x  Jfc  array)' 
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P 

Ratio 
k  =  2 

Ratio 
k  =  4 

Ratio 
k  =  8 

32 

20 

640# 

3 

0.32 

1.75 

8.61 

38 

20 

1250 K 

3 

0.20 

1.16 

5.23 

BH  algorithm  summary:  The  following  table  summarizes  the  compu¬ 
tation  and  communication  time  for  the  BH  algorithm. 


n 
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particles 

k=2 

k=4 

comm-  comp- 

comm-  comp- 

time 

time 

time 

time 

sec 

sec 

32 

20 

640  K 

1.3 

126.4 

0.33 

16.0 

38 

20 

1250K 

1.84 

179.2 

0.46 

22.4 

k=8 

comm-  comp¬ 
time  time 
sec 

0.08  2.0 

0.11  2.4 


The  above  numerical  results  can  be  compared  with  the  the  performance 
of  the  BH  algorithm  on  the  pyramid  (section  7.1). 
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m 

particles 

Ratio 

k=2 

Ratio 

k=4 

Ratio 

k=8 

32 

20 

640K 

0.15 

1.21 

9.8 

38 

20 

1250K 

0.14 

1.25 

9.12 

Summary:  It  follows  from  the  previous  section  that 

computation  time  pyramid  +  communication  time  pyramid  < 

total  time  hypercube  <  (74) 

computation  time  pyramid  +  d  x  ( communication  time  pyramid). 

where  d  =  4.  Bounds  for  the  ratio  of  the  total  time,  hypercube,  to  the  total 
time,  array,  can  be  derived  from  this  relation.  Since  I  believe  that  d  =  4 
is  a  rather  conservative  bound,  I  am  giving  the  hypercube  the  benefit  of 
the  doubt  and,  for  comparison  sake,  consider  its  performance  equal  to  the 
pyramid. 

The  above  numerical  results  indicates  that  the  array  can  comfortably 
compete  with  either  the  pyramid  or  the  hypercube  (both  with  the  CM2 
connection  machine  parameters).  A  general  comparison  of  the  array  and 
the  hypercube  has  to  weight  two  additional  factors:  First,  the  hypercube, 
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8  Near-field  Computation 


This  section  estimates  the  computation  and  communication  time  required 
for  near- field  computation.  These  estimates  are  added  to  the  estimates  of 
the  far-field  computation  to  yield  an  estimate  for  the  total  time  required. 

Let  q  be  a  point  mass  at  some  cell  n.  The  near-field  force  on  q  is  the  force 
exerted  on  q  by  particles  in  n  an  in  neighbors(n).  This  force  is  computed 
using 

ci  _r,  MxMy  Xi-yi 


FXi  =  G- 


x  -  y 


x-y 


where  Mx  and  My  are  the  two  mass  points  in  3D  whose  coordinates  are 
x  =  (xi,X2,i3)  and  y  =  (3/1,5/2,173),  respectively,  and  FXi  is  the  force  on  Mx 
in  the  x,  direction.  This  calculation  is  rather  straight  forward  and  the  only 
issue  is  how  to  take  advantage  of  the  symmetry  of  the  force  (FXi  =  —FXt) 
which,  in  parallel  computation,  requires  some  elaboration. 

Consider  the  pyramid  computational  structure.  Only  the  bottom  layer 
processes  participate  in  the  near-field  computation.  The  following  is  a  step 
by  step  description  of  this  computation. 

•  For  all  cells  c  in  the  bottom  layer  evaluate  the  forces  on  each  mass  in 
c  due  to  the  other  mass  points  in  c. 

The  evaluation  of  equation  75  for  all  force  components  requires  2t+  + 
6f_  +23*.  +  */  =  36*+.  There  are  \m{m  —  1)  such  computations;  Under 
the  assumptions  stated  in  section  6  the  total  time  is  |m(m  —  1)*+. 

•  Consider  a  cell  c  and  its  neighbors,  c  and  each  of  its  neighbors,  say  e, 
determine  a  direction.  This  direction  is  determined  by  xe  —  xc  where 
xe  is  the  center  of  the  neighbor  cell  and  xc  is  the  center  of  c.  Clearly, 
if  xe  —  xe  is  a  direction,  so  is  xc  —  xe.  Let  D  be  a  set  of  directions  such 
that  if  d  is  a  direction  the  either  d  €  D  or  —  d  €  D  but  not  both. 

For  all  directions  d  in  D  do 

-  Each  bottom  layer  cell  c  sends  the  value  and  coordinates  of  each 
of  its  mass  points  to  its  neighbor  in  the  d  direction. 
Communication  time  is  m  message  unites. 


-  Each  cell  c  provides  3m  locations  for  the  forces  on  the  mass  points 
whose  values  and  coordinates  it  just  received.  Next  it  calculates 
the  forces  between  any  two  pairs  of  mass  points,  one  in  c  and  one 
that  its  values  just  arrived  from  the  neighbor. 

Computation  time  is  36m(m  —  l)t+ 

-  The  values  accumulated  in  the  3m  locations  are  sent  back  to  the 
neighbor  cell  from  which  the  mass  points  came. 

Communication  time  is  m  message  units. 


Since  the  number  of  directions  in  D  is  ^  the  toted  time  is 


36  fj 

(ym(m  -  l)t+  +  y(36m(m  -  1)4  +  2 mtme,Mge  unit) 


Since  the  communication  time  is  small  as  compared  to  the  computation 
time  we  (almost)  get  the  1/2  factor  discussed  in  the  beginning  of  this  section. 
This  is  achieved  by  having  one  processor  calculate  the  force  and  send  the 
results  back  to  its  neighbor. 

Using  the  CM2  connection  machine  parameters  once  again  (20k  floating 
point  operations  per  second,  250  microseconds  per  message)  the  time  for 
evaluating  the  near-field  by  the  pyramid  is  given  below  (n„  is  26  for  3D), 
m  communication  time  computation  time 
n„m  36(^  +  l)m(m  —  1)4 


10  65  2,268 

20  130  9,576 

The  array  of  larger  processors  is  faring  just  as  well  in  the  near-field  com¬ 
putation.  As  seen  from  the  above,  communication  time  is  negligible.  Since 
only  the  bottom  layer  processors  of  the  pyramid  participate  in  the  calculation 
we  get  for  an  n  x  n  x  n  base  pyramid 


pyramid  time  1 


k  x  k  x  k  processor  array  time  (ffilpeedup 


4  for  a  pyramid  processor 


4  for  a  larger  processor 
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9  Summary  and  Conclusions 

The  work  formulates  the  N-body  problem  as  a  set  of  recursive  equations 
with  boundary  conditions.  The  numerical  solution  to  this  set  of  equations 
can  be  viewed  as  a  pyramid  structure  of  processes.  There  are  three  kinds  of 
processes:  the  bottom  level  processes,  the  top  level  processes  and  all  the  rest. 
All  processors  are  implemented  by  means  of  five  functions:  Translate,, 

Convert z,  Shift,,  Evaluate.  Both  the  BH  and  the  GR  algorithms  have 
the  same  structure.  They  differ  from  each  other  in  the  above  functions. 
This  formulation  of  the  N-body  problem  simplifies  the  understanding  of  the 
algorithms,  their  analysis  and  programming. 

The  pyramid  structure  is  mapped  into  three  different  architectures  that 
can  be  viewed  as  possible  hardware  implementations.  For  each  mapping  the 
amount  of  time  for  arithmetic  computations  and  the  time  for  communication 
between  processors  is  estimated.  Out  of  these  estimates  several  qualitative 
conclusions  are  derived: 

Parallel  execution  reduces  both  the  BH  and  the  GR  algorithms  to  O(log  N), 
where  N  is  the  number  of  particles;  this  bound  covers  both  computation  and 
communication  time.  This  is  somewhat  of  a  surprise,  because  in  the  serial 
computation  the  BH  algorithm  is  considered  a  0(N  log  N)  algorithm  while 
the  GR  one  is  O(N). 

The  algorithms  exhibit  both  common  and  differing  patterns  of  behavior. 
In  both  cases,  given  the  the  height  of  the  tree  h,  the  time  spent  in  the  tree 
is  independent  of  m,  the  number  of  particles  in  a  cell.  The  evaluation  stage, 
however,  is  hard  on  BH;  for  large  m  most  of  the  time  is  spent  in  it,  and 
computation  time  far  exceeds  communication  time.  For  large  m  the  GR 
algorithm  performs  exceedingly  well;  the  evaluation  of  the  potential  function 
for  each  particle  is  straightforward.  Clearly,  using  a  38  x  38  x  38  pyramid, 
the  GR  algorithm  can  handle  10M  equally  distributed  particles  almost  as 
fast  as  it  handles  1M.  At  these  numbers,  the  increase  of  m  from  20  to  200 
increases  the  total  time  by  about  25  percent  only  because  the  algorithm  is 
bounded  by  the  communication  time  rather  than  by  the  evaluation  time  at 
the  bottom  cell. 

Since  the  BH  algorithm  is  much  simpler  to  program  it  is  worthwhile  noting 
that  its  performance  for  62K  particles,  one  at  a  cell,  is  about  equal  to  that 
of  the  GR  algorithm  when  p  =  2,  using  the  same  number  of  particles.  A 
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complete  comparison  of  BH  and  GR  requires  comparing  the  algorithms  for 
the  same  global  error.  This  work  has  not  been  completed  as  yet. 

A  hypercube  machine  can  simulate  the  pyramid  structure  without  conflict 
of  messages  and  with  a  slowdown  equal  to  or  less  than  d  =  4. 

Using  the  GR  algorithm  with  p  =  3  and  a  connection  machine  of  the 
present  size,  it  is  possible  to  calculate  a  solution  to  the  1M  particle  (static) 
many  body  problem  in  about  24  seconds.  This  number  is  based  on  multiply¬ 
ing  d  =  4  by  the  communication  time  computed  for  the  tree  and  adding  the 
computation  times  for  the  far-  field  and  for  the  near-field.  It  is  considered  a 
rather  conservative  estimate.  This  number  is  also  the  estimate  for  one  time 
step  for  the  dynamic  many  body  problem. 

For  running  the  N-body  algorithm  the  connection  machine  has  a  clear 
advantage:  it  is  an  existing  product  that  can  be  readily  used.  The  analysis, 
however,  points  out  some  possible  disadvantages;  while  64K  processors  seem 
quite  a  large  number,  the  3D  N-body  problem  reaches  the  limit  rather  fast. 
One  has  to  increase  m  ,  assigning  more  particles  per  cell.  Clearly,  at  least  for 
the  BH  case,  we  would  rather  increase  the  number  of  processors;  a  feature 
the  hardware  does  not  support. 

Rough  performance  estimates  for  an  array  of  faster  processors  indicate 
that  with  a  modest  size  array  it  is  possible  to  get  performance  competitive 
to  that  of  the  connection  machine. 

The  above  calculations  considered  particles  that  are  equally  distributed 
in  space.  Often,  physical  system  exhibit  sparsity,  certain  regions  of  space 
have  many  particles  while  others  are  empty  or  nearly  so.  Taking  advantage 
of  sparsity  awaits  farther  work. 

The  computation  and  the  communication  times  computed  should  be  con¬ 
sidered  initial  estimates  as  data  moving  operations  were  not  taken  into  ac¬ 
count.  Moreover,  these  estimates  await  experimental  verification,  preferably 
by  programs  simulating  the  above  processes.  This  work  is  currently  under¬ 
way. 
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